Low cost page quality factors to detect web spam

نویسندگان

  • Ashish Chandra
  • Mohammad Suaib
  • Md. Rizwan Beg
چکیده

Web spam is a big challenge for quality of search engine results. It is very important for search engines to detect web spam accurately. In this paper we present 32 low cost quality factors to classify spam and ham pages on real time basis. These features can be divided in to three categories: (i) URL features, (ii) Content features, and (iii) Link features. We developed a classifier using Resilient Back-propagation learning algorithm of neural network and obtained good accuracy. This classifier can be applied to search engine results on real time because calculation of these features require very little CPU resources.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Novel Approach for Combating Spamdexing in Web using UCINET and SVM Light Tool

Search Engine spam is a web page or a portion of a web page which has been created with the intention of increasing its ranking in search engines. Web spamming refers to actions intended to mislead search engines and give some pages higher ranking than they deserve. Anyone who uses a search engine frequently has most likely encountered a high ranking page that consists of nothing more than a bu...

متن کامل

A Spamicity Approach to Web Spam Detection

Web spam, which refers to any deliberate actions bringing to selected web pages an unjustifiable favorable relevance or importance, is one of the major obstacles for high quality information retrieval on the web. Most of the existing web spam detection methods are supervised that require a large and representative training set of web pages. Moreover, they often assume some global information su...

متن کامل

Multi-View Learning for Web Spam Detection

Spam pages are designed to maliciously appear among the top search results by excessive usage of popular terms. Therefore, spam pages should be removed using an effective and efficient spam detection system. Previous methods for web spam classification used several features from various information sources (page contents, web graph, access logs, etc.) to detect web spam. In this paper, we follo...

متن کامل

Anti-Trust Rank for Detection of Web Spam and Seed Set Expansion

In the recent times, the Web has been the most popular and perhaps the most efficient platform for sharing, storing as well as retrieving information. Finding the required information from the Web is facilitated by search engines. Search engines form the interface between the Web and the users. Given the vast amount of information available on the Web, search engines must pick a small subset of...

متن کامل

Identifying Spam Web Pages Based on Content Similarity

The Web provides its users with abundant information. Unfortunately, when a Web search is performed, both users and search engines are faced with an annoying problem: the presence of misleading Web pages, i.e., spam Web pages, that are ranked among legitimate Web pages. The mixed results downgrade the performance of search engines and frustrate users who are required to filter out useless infor...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1410.2085  شماره 

صفحات  -

تاریخ انتشار 2014